Unveiling the Secrets of Wine Quality¶
Bypassing traditional tasting methods, this project employs data analysis to predict the quality of wine based on its chemical features.
3. Data Analysis¶
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
import os
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn import set_config
from sklearn.model_selection import KFold
from sklearn.model_selection import GridSearchCV
from sklearn.svm import LinearSVR
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from scipy.stats import zscore
from scipy.stats import pearsonr
import scipy.stats as stats
import statsmodels.api as sm
3.1 Extract Data: Reading from CSV Files¶
# These csv files are downloaded from the UCI website:
# http://archive.ics.uci.edu/dataset/186/wine+quality.
df_white_wine = pd.read_csv("raw_data/winequality-white.csv", sep=";")
df_red_wine = pd.read_csv("raw_data/winequality-red.csv",sep=";")
df_white_wine.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.27 | 0.36 | 20.7 | 0.045 | 45.0 | 170.0 | 1.0010 | 3.00 | 0.45 | 8.8 | 6 |
| 1 | 6.3 | 0.30 | 0.34 | 1.6 | 0.049 | 14.0 | 132.0 | 0.9940 | 3.30 | 0.49 | 9.5 | 6 |
| 2 | 8.1 | 0.28 | 0.40 | 6.9 | 0.050 | 30.0 | 97.0 | 0.9951 | 3.26 | 0.44 | 10.1 | 6 |
| 3 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
| 4 | 7.2 | 0.23 | 0.32 | 8.5 | 0.058 | 47.0 | 186.0 | 0.9956 | 3.19 | 0.40 | 9.9 | 6 |
df_red_wine.head()
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
| 1 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25.0 | 67.0 | 0.9968 | 3.20 | 0.68 | 9.8 | 5 |
| 2 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15.0 | 54.0 | 0.9970 | 3.26 | 0.65 | 9.8 | 5 |
| 3 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17.0 | 60.0 | 0.9980 | 3.16 | 0.58 | 9.8 | 6 |
| 4 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11.0 | 34.0 | 0.9978 | 3.51 | 0.56 | 9.4 | 5 |
wine_lists = [df_red_wine, df_white_wine]
df_white_wine.wine_type = "White Wine"
df_red_wine.wine_type = "Red Wine"
def get_wine_str(wine_type_df):
return getattr(wine_type_df, 'wine_type', "Unknown Wine")
3.2 Transformation¶
Here we want to create QQ Plots to understand if the features follow normal distribution or not.
def check_numeric_columns(wine_type_df):
"""
Select and return the names of numeric columns from the given DataFrame.
Parameters:
- wine_type_df: a DataFrame containing data related to wine types.
Returns:
- list of column names corresponding to numeric columns.
"""
numeric_columns = wine_type_df.select_dtypes(include=[np.number])
column_names = numeric_columns.columns
return column_names
def create_qq_plot(wine_type_df, folder):
"""
Create Q-Q plot for each of the feautres of dataset, display them on one canva
Parameters:
- wine_type_df: The DataFrame containing data related to wine types.
"""
wine_type = get_wine_str(wine_type_df)
# Select numerical columns
numeric_columns = check_numeric_columns(wine_type_df)
fig, axs = plt.subplots(3, 4, figsize=(20, 15))
axs = axs.flatten()
# Q-Q plot
for i, column in enumerate(numeric_columns):
if i <= 11:
data = wine_type_df[column]
stats.probplot(data, dist="norm", plot=axs[i])
axs[i].set_title(column, fontsize = 30)
axs[i].set_xlabel('')
axs[i].set_ylabel('')
title = f"QQ Plots for {wine_type}"
fig.suptitle(title, fontsize=40, weight='bold')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
plt.show()
def create_plots(plot_function, saving_folder, wine_list=wine_lists):
"""
Create plots using a specified plot function for each item in the list od datasets.
Parameters:
- plot_function (callable): The function used to generate plots.
- saving_folder (str): The folder name to save the plots.
- wine_list (list): The list of items for which plots will be created.
"""
# Create a folder for graphs if it doesn't exist
if saving_folder is not None and not os.path.isdir(f'figures/{saving_folder}'):
os.makedirs(f'figures/{saving_folder}')
# Generate plots for each item in the list
for item in wine_list:
plot_function(item, saving_folder)
def use_function(function, wine_list=wine_lists):
"""
Apply a specified function to each df in the list.
Parameters:
- function (callable): The function to be applied.
- wine_list (list): The list of df for which the function will be applied.
"""
for item in wine_list:
function(item)
create_plots(create_qq_plot,'1_QQ_plots', wine_lists)
Observations from QQ Plots of Red and White Wine Datasets:¶
Right Skewness:
- Red Wine:
Residual SugarandAlcoholshow deviations in the lower quantiles. - White Wine: Similar to Red Wine,
Residual SugarandAlcoholshow deviations in the lower quantiles.
- Red Wine:
Left Skewness:
- Red Wine:
Free Sulfur Dioxide,Chlorides, andSulphatesshow deviations in the upper quantiles. - White Wine:
Volatile Acidity,Chlorides, andSulphatesshow deviations in the upper quantiles.
- Red Wine:
Implications for Data Processing:
- The observed skewness in both datasets suggests the need for normalization transformations. We will continue to calculate skewness coefficient.
- Techniques like logarithmic or Box-Cox transformations may be beneficial to address these deviations and improve the homogeneity of the data.
def calculate_skewness_coefficient(wine_type_df):
print(f"\nThe skewness coefficient of {get_wine_str(wine_type_df)}: \n")
numerical_columns = wine_type_df.select_dtypes(include=['number']).columns
for column in numerical_columns:
skewness = round(wine_type_df[column].skew(),2)
print(f"{column}: {skewness}")
# use_function(calculate_skewness_coefficient, wine_lists)
Observations from the skewness coefficient¶
- White Wine:
- Chlorides (5.02): Highly skewness.
- Volatile Acidity (1.58), Citric Acid (1.28), Residual Sugar (1.08), Free Sulfur Dioxide (1.41): : Moderate skewness.
- Red Wine:
- Residual Sugar (4.54) and Chlorides (5.68): Highly skewness
- Free Sulfur Dioxide (1.25), Total Sulfur Dioxide (1.52), Sulphates (2.43): Moderate skewness.
We are going to do log transformation for features of highly skewness and moderate skewness.
white_wine_log_columns = [
'chlorides',
'volatile acidity',
'citric acid',
'residual sugar',
'free sulfur dioxide'
]
red_wine_log_columns = [
'residual sugar',
'chlorides',
'free sulfur dioxide',
'total sulfur dioxide',
'sulphates'
]
log_red_wine_df = df_red_wine.copy()
log_white_wine_df = df_white_wine.copy()
log_red_wine_df[red_wine_log_columns] = np.log(log_red_wine_df[red_wine_log_columns] + 0.001)
log_white_wine_df[white_wine_log_columns] = np.log(log_white_wine_df[white_wine_log_columns]+ 0.001)
log_red_wine_df.wine_type = "Red Wine(Log)"
log_white_wine_df.wine_type = "White Wine(Log)"
log_dfs = [log_red_wine_df, log_white_wine_df]
create_plots(create_qq_plot, "2_QQ_plots_log", log_dfs)
#use_function(calculate_skewness_coefficient, log_dfs)
Observations based on the first log transformation:¶
Red Wine (Original vs Log-Transformed):
- Original:
Residual sugar4.54,Chlorides5.68. - Log-Transformed:
Residual sugar1.81,Chlorides1.79. Free sulfur dioxidechanged from positive (1.25) to slightly negative skewness (-0.23).
- Original:
White Wine (Original vs Log-Transformed):
- Original:
Volatile acidity1.58,Citric acid1.28. - Log-Transformed:
Volatile acidity0.14,Citric acid-5.56 (over-correction). Residual sugarreduced from 1.08 to -0.16,Chloridesfrom 5.02 to 1.19.
- Original:
Minimal Impact on Some Variables:
Alcoholandqualityin both Red and White wines showed minimal changes (around 0.86 and 0.22 respectively).
Avoid Log Transformation For:
- Red Wine:
Free sulfur dioxideandTotal sulfur dioxide. - White Wine:
Citric acidandResidual sugar.
- Red Wine:
white_wine_update_log_columns = [
'chlorides',
'volatile acidity'
]
red_wine_update_log_columns = [
'residual sugar',
'chlorides'
]
log_red_update_wine_df = df_red_wine.copy()
log_white_update_wine_df = df_white_wine.copy()
log_red_update_wine_df[red_wine_update_log_columns] = np.log(log_red_update_wine_df[red_wine_update_log_columns] + 0.001)
log_white_update_wine_df[white_wine_update_log_columns] = np.log(log_white_update_wine_df[white_wine_update_log_columns]+ 0.001)
log_red_update_wine_df.wine_type = "Red Wine(Second Log)"
log_white_update_wine_df.wine_type = "White Wine(Second Log)"
log_update_dfs = [log_white_update_wine_df, log_red_update_wine_df]
create_plots(create_qq_plot, '3_QQ_plots_log_corrected', log_update_dfs)
#use_function(calculate_skewness_coefficient, log_update_dfs)
Observations from the model with log:¶
- White Wine:
- Log transformation significantly reduced skewness in
volatile acidity(from 1.58 to 0.14) andchlorides(from 5.02 to 1.19).
- Log transformation significantly reduced skewness in
- Red Wine:
- Effective reduction in skewness for
residual sugar(from 4.54 to 1.81) andchlorides(from 5.68 to 1.79).
- Effective reduction in skewness for
- Conclusion:
- The second log transformation was successful in reducing high skewness for key variables in both Red and White Wine datasets.
3.3 Clean Data¶
3.3.1 Check for missing values¶
We are going to check if there are empty values.
def check_na(wine_type_df):
print(f'{get_wine_str(wine_type_df)}')
print(wine_type_df.info())
print(wine_type_df.isnull().sum())
use_function(check_na, log_update_dfs)
White Wine(Second Log) <class 'pandas.core.frame.DataFrame'> RangeIndex: 4898 entries, 0 to 4897 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 4898 non-null float64 1 volatile acidity 4898 non-null float64 2 citric acid 4898 non-null float64 3 residual sugar 4898 non-null float64 4 chlorides 4898 non-null float64 5 free sulfur dioxide 4898 non-null float64 6 total sulfur dioxide 4898 non-null float64 7 density 4898 non-null float64 8 pH 4898 non-null float64 9 sulphates 4898 non-null float64 10 alcohol 4898 non-null float64 11 quality 4898 non-null int64 dtypes: float64(11), int64(1) memory usage: 459.3 KB None fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64 Red Wine(Second Log) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1599 entries, 0 to 1598 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 fixed acidity 1599 non-null float64 1 volatile acidity 1599 non-null float64 2 citric acid 1599 non-null float64 3 residual sugar 1599 non-null float64 4 chlorides 1599 non-null float64 5 free sulfur dioxide 1599 non-null float64 6 total sulfur dioxide 1599 non-null float64 7 density 1599 non-null float64 8 pH 1599 non-null float64 9 sulphates 1599 non-null float64 10 alcohol 1599 non-null float64 11 quality 1599 non-null int64 dtypes: float64(11), int64(1) memory usage: 150.0 KB None fixed acidity 0 volatile acidity 0 citric acid 0 residual sugar 0 chlorides 0 free sulfur dioxide 0 total sulfur dioxide 0 density 0 pH 0 sulphates 0 alcohol 0 quality 0 dtype: int64
Observations from Checking Missing Values:¶
- There are no missing data in either dataset.
- All feature columns are of the float type, and the target column is an integer.
3.3.2 Outlier Analysis¶
We will create box plots to understand the outliers.
def create_box_plot(wine_type_df, folder):
feature_columns = wine_type_df.drop(columns=['quality']).columns.tolist()
fig, axs = plt.subplots(3, 4, figsize=(25, 18))
# Flatten the array of axes to make it easier to iterate over
axs = axs.flatten()
for i, column in enumerate(feature_columns):
sns.boxplot(x='quality', y=column, data=wine_type_df, ax=axs[i], legend=False, color = "grey")
axs[i].set_title(column, fontsize=30, weight='bold')
axs[i].set_xlabel('')
axs[i].set_ylabel('')
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
title = f'Box Plots for {get_wine_str(wine_type_df)}'
fig.suptitle(title, fontsize=40, weight='bold')
plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
plt.show()
#create_plots(create_box_plot, '1 _Boxplots_wine_type', log_update_dfs)
Observations from Box Plots¶
- It's hard to decide should we delete outliers or not based on these plots.
- We are going to develop separate QQ plots for wines classified into different quality categories: bad(3-4), middle(5-6-7), and good(8-9) to better understand our outliers.
To understand distribution asymmetries, we use skewness coefficient to identify which variables deviate from normality.
To pinpoint extreme outliers, we use a a threshold of Z-score > 6 to focuse on the most anomalous data points in the wine quality dataset.
def calculate_z_score(wine_type_df):
z_scores_df = wine_type_df.copy()
print(f"\nZ Score of {get_wine_str(wine_type_df)}:\n")
for col in wine_type_df.columns:
if col != 'quality' and col != 'quality_category':
z_scores_df[col + ' Z-score'] = zscore(wine_type_df[col])
outliers = abs(z_scores_df[col + ' Z-score']) > 6
outlier_count = outliers.sum()
print(f"Outliers in {col}: {outlier_count}")
if outlier_count > 0:
print(f"Outlier data for {col}:\n{wine_type_df[outliers]}")
print("\n")
#use_function(calculate_z_score, log_update_dfs)
Observations from Skewness Coefficients and Z-Score Analysis¶
White Wine:¶
- Poor Quality White Wine:
- Notable skewness in
chlorides(1.51) andfree sulfur dioxide(1.85). - Outliers are mainly in
free sulfur dioxideandtotal sulfur dioxide.
- Notable skewness in
- Good Quality White Wine:
- Higher skewness in
citric acid(1.94) anddensity(1.53). - Outliers are present in several attributes, notably in
citric acid(7 outliers) anddensity(1 outlier).
- Higher skewness in
Red Wine:¶
- Poor Quality Red Wine:
- Considerable skewness in
chlorides(2.51) andsulphates(3.08). - Outliers are observed in
sulphates(2) andalcohol(1).
- Considerable skewness in
- Good Quality Red Wine:
- Skewness is pronounced in
total sulfur dioxide(2.64) andsulphates(2.18). - Outliers are mainly in
total sulfur dioxide(2) andsulphates(2).
- Skewness is pronounced in
def delete_outliers(wine_type_df):
without_outliers_df = wine_type_df.copy()
for col in without_outliers_df.columns:
if col != 'quality' and col != 'quality_category':
without_outliers_df[col + '_Z_score'] = zscore(without_outliers_df[col])
for col in without_outliers_df.columns:
if '_Z_score' in col:
without_outliers_df = without_outliers_df[np.abs(without_outliers_df[col]) <= 6]
return without_outliers_df.drop(columns=[c for c in without_outliers_df if '_Z_score' in c])
df_white_without_outliers = log_white_update_wine_df.copy()
df_red_without_outliers = log_red_update_wine_df.copy()
df_white_without_outliers = delete_outliers(log_white_update_wine_df)
df_red_without_outliers = delete_outliers(log_red_update_wine_df)
df_white_without_outliers.wine_type = 'White Wine (Without Outliers)'
df_red_without_outliers.wine_type = 'Red Wine (Without Outliers)'
wine_without_outliers_dfs = [df_white_without_outliers, df_red_without_outliers]
bin_edges = [2, 5, 10]
bin_labels = ['poor', 'good']
red_wine_quality_df = df_red_without_outliers.copy()
white_wine_quality_df = df_white_without_outliers.copy()
red_wine_quality_df['quality_category'] = pd.cut(red_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
white_wine_quality_df['quality_category'] = pd.cut(white_wine_quality_df['quality'], bins=bin_edges, labels=bin_labels)
df_red_poor_without_outliers = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'poor']
df_red_good_without_outliers = red_wine_quality_df[red_wine_quality_df['quality_category'] == 'good']
df_white_poor_without_outliers = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'poor']
df_white_good_without_outliers = white_wine_quality_df[white_wine_quality_df['quality_category'] == 'good']
df_white_poor_without_outliers.wine_type = 'White Wine Poor (Without Outliers)'
df_white_good_without_outliers.wine_type = 'White Wine Good (Without Outliers)'
df_red_poor_without_outliers.wine_type = 'Red Wine Poor (Without Outliers)'
df_red_good_without_outliers.wine_type = 'Red Wine Good (Without Outliers)'
wine_quality_without_outliers_dfs = [
df_white_poor_without_outliers,
df_white_good_without_outliers,
df_red_poor_without_outliers,
df_red_good_without_outliers
]
create_plots(create_box_plot, '2_Boxplots_without_outliers', wine_without_outliers_dfs)
create_plots(create_box_plot, '3_Boxplots_quality_without_outliers', wine_quality_without_outliers_dfs)
4.1 Visualize Final Data¶
- Histograms of quality distribiution for each of the wine type
# getting quality counts for red and white wines and sort them by quality
red_quality_counts = df_red_wine['quality'].value_counts().sort_index()
white_quality_counts = df_white_wine['quality'].value_counts().sort_index()
# normalize the counts to get densities
red_density = red_quality_counts / red_quality_counts.sum()
white_density = white_quality_counts / white_quality_counts.sum()
# convert to DataFrame
red_df = red_density.reset_index().rename(columns={'index': 'Quality', 'quality': 'quality'})
white_df = white_density.reset_index().rename(columns={'index': 'Quality', 'quality': 'quality'})
merged_df = pd.merge(red_df, white_df, on='quality', how='outer').fillna(0)
bar_plot_df = merged_df.rename(columns={'count_x': 'Red Wine Density', 'count_y': 'White Wine Density'})
# match columns for both features
melted_df = pd.melt(bar_plot_df, id_vars='quality', var_name='Wine Type', value_name='Density')
# plotting
plt.figure(figsize=(12, 8))
sns.barplot(x='quality', y='Density', hue='Wine Type', data=melted_df, palette='viridis')
# description
plt.title('Density Distribution of Red and White Wines by Quality', fontsize=30, weight ='bold')
plt.xlabel('Quality', fontsize=25)
plt.ylabel('Density', fontsize=25)
plt.xticks(rotation=45, fontsize=20)
plt.yticks(fontsize=20)
plt.legend(fontsize=20)
# save fig
plt.savefig('figures/Density Distribution of Red and White Wines by Quality.png', bbox_inches = 'tight')
plt.show()
- Raw Clustermap of our cleaned data
# create labels of color to our data
df_red_wine['color'] = 1
df_white_wine['color'] = 2
df_wine = pd.concat([df_red_wine, df_white_wine[1:]], axis=0).reset_index(drop=True)
def create_cluster_by_rows(dataset):
'''
Displays a clustermap plot with labels.
Parameters:
- dataset (pandas.DataFrame): The DataFrame containing data related to red and white wines.
'''
# select numerical columns from the dataset
data = dataset.select_dtypes(include=['number'])
# extract and remove the 'color' column from the data
agerange = data.pop("color")
# define color mapping for wine types labels
colors_of_labels = {1: "#b11226", 2: "#EEEDC4"}
# Map wine types to colors
labels = agerange.map(colors_of_labels)
df_with_labels = pd.DataFrame(data={'wine type': labels})
# create the clustermap
g = sns.clustermap(df_wine.iloc[:,:-2], cmap='plasma', row_colors=df_with_labels, annot_kws={'size': 30},
standard_scale=1, figsize=(15, 20))
# Hide row and column dendrograms
g.ax_row_dendrogram.set_visible(False)
g.ax_col_dendrogram.set_visible(False)
# describe
title = 'Composite Clustermap White and Red Wine'
plt.suptitle('Composite Clustermap White and Red Wine', fontsize=40, weight ='bold',y= 0.85, x = 0.60)
plt.setp(g.ax_heatmap.get_xticklabels(), fontsize=30, rotation = 45, ha='right')
g.ax_cbar.set_position((1, .2, .03, .4))
plt.savefig(f'figures/{title}.png', bbox_inches = 'tight')
plt.show()
create_cluster_by_rows(df_wine)
5. Model creation¶
Firstly, for the model we have the choose which features we want to use.
-> Create clustermap and heatmap to understant the relationship between features and wine quality score.
Comments about clustermap:¶
- Our model's performance is influenced by the considerable multicollinearity observed in the data. Identifying features with minimal correlation is essential.
- To achieve this, we compare different feature lists, selecting one from closely correlated pairs for model evaluation. Each model is restricted to a maximum of 8 features. This approach aims to uncover independent and meaningful predictors for the development of an optimal model.
def create_corr_matrix(wine_type_df):
"""
Create and return the correlation matrix for the numeric columns in the given df.
Parameters:
- wine_type_df (pandas.DataFrame): DataFrame that we want to do matrix on
Returns:
The correlation matrix for the numeric columns
"""
numeric_columns = check_numeric_columns(wine_type_df)
return wine_type_df[numeric_columns].corr()
def create_clustermap(wine_type_df, folder):
"""
Create and display a clustermap of the correlation matrix for the df.
Parameters:
- wine_type_df (pandas.DataFrame): The DataFrame that we want to display clustermap
"""
# create the correlation matrix
correlation_matrix = create_corr_matrix(wine_type_df)
# calculate p-values for each correlation coefficient
pvalues = correlation_matrix.apply(lambda x:
correlation_matrix.columns.map(lambda y:
pearsonr(wine_type_df[x.name], wine_type_df[y])[1]))
# create the mask for non-significant correlations (p-value > 0.05)
mask_significant = pvalues > 0.05
# create and display the clustermap
g = sns.clustermap(correlation_matrix,
mask=mask_significant, annot=True,
cmap='seismic',
row_cluster=True,
col_cluster=True,
linewidths=.5, vmax=1, vmin=-1, annot_kws={'size': 10},
figsize=(12, 12))
# hide column dendrogram
g.ax_col_dendrogram.set_visible(False)
# descrption
title=f'Correlation Matrix Clustermap for {get_wine_str(wine_type_df)}'
g.fig.suptitle(title, fontsize=20, weight='bold' , y=0.85)
plt.setp(g.ax_heatmap.get_xticklabels(), fontsize=12, rotation=45, ha='right')
plt.setp(g.ax_heatmap.get_yticklabels(), fontsize=12)
g.ax_cbar.set_position((1, .2,.03, .4))
# save
plt.savefig(f'figures/{folder}/{title}.png', bbox_inches = 'tight')
plt.show()
create_plots(create_clustermap, '1_Clustermaps_wine_type', wine_without_outliers_dfs)
create_plots(create_clustermap, '2_Clustermaps_quality_without_outliers', wine_quality_without_outliers_dfs)
5.1 Creating features lists for models to find the best combination¶
def create_corr_with_target(wine_type_df, target = 'quality'):
correlation_matrix = create_corr_matrix(wine_type_df)
correlation_with_target = correlation_matrix[target].sort_values(key=abs, ascending=False)
return correlation_with_target
def return_top_features_with_target(wine_type_df, n = 8):
'''
A function that return top features that are related to the target column.
'''
correlation_with_target = create_corr_with_target(wine_type_df)
top_features = correlation_with_target[1:n+1].index.tolist()
return top_features
def return_features_among_highly_associated(wine_type_df, threshold = 0.6, n =8 ):
'''
Any feature that has a correlation greater than threshold with any of the already selected features is not chosen
'''
correlation_matrix = create_corr_matrix(wine_type_df)
correlation_with_target = create_corr_with_target(wine_type_df)
#print(get_wine_str(wine_type_df))
# print(correlation_matrix)
#print(correlation_with_target)
selected_features = [correlation_with_target.index[1]]
for feature in correlation_with_target.index[2:]:
if all(abs(correlation_matrix[feature][selected_feature]) < threshold for selected_feature in selected_features):
selected_features.append(feature)
if len(selected_features) == n:
break
return selected_features
top_features_dict = {}
features_among_highly_associated_dict = {}
def create_features_dict(df_lists):
for wine_df in df_lists:
try:
df_name = get_wine_str(wine_df)
top_features_dict[df_name] = return_top_features_with_target(wine_df)
features_among_highly_associated_dict[df_name] = return_features_among_highly_associated(wine_df)
except Exception as e:
print(f"Error processing {df_name}: {e}")
features_among_highly_associated_dict
{}
create_features_dict(wine_quality_without_outliers_dfs)
create_features_dict(wine_without_outliers_dfs)
def evaluate_ols_model(X, y, test_size=0.2, random_state=42):
"""
Cfreate OLS regression model and evalate it.
Parameters:
- X (pandas.DataFrame): Independent variables.
- y (pandas.Series): Dependent variable.
- test_size (float): Size of the test set in the train-test split (default: 0.2).
- random_state (int): Random seed for reproducibility (default: 42).
Returns:
OLS model results summary, R-squared score, and Mean Squared Error (MSE) on the test set.
"""
#train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# add a constant term to the independent variables for the OLS model
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)
# fit & predict
model = sm.OLS(y_train, X_train)
model_results = model.fit()
y_pred = model_results.predict(X_test)
# evaluate
test_r2 = r2_score(y_test, y_pred)
test_mse = mean_squared_error(y_test, y_pred)
return model_results, test_r2, test_mse
results_df = pd.DataFrame()
def find_best_model(df, features=[top_features_dict, features_among_highly_associated_dict]):
"""
Find the best model among different feature sets for a given dataset.
Parameters:
- df (pandas.DataFrame): Dataset.
- features (list): List of feature sets to evaluate.
Returns:
- pandas.Series: Series containing information about the best OLS model: model name, model, features list, r2 for test set, MSE for test set.
"""
dataset_name = get_wine_str(df)
best_r2 = 0
for feature_set in features:
# name the model
model_name = f"Model_{dataset_name}"
# create X and y sets
X = df[feature_set[dataset_name]]
y = df['quality']
# evaluate the OLS model
model, r2, mse = evaluate_ols_model(X, y)
# Compare R-squared scores
if best_r2 < r2:
best_model, best_features, best_r2, best_mse, = model, list(X.columns), r2, mse
result_model = pd.Series({
'Model name': model_name,
'model': best_model,
'Features': best_features,
'Test r2': best_r2,
'Test MSE': best_mse
})
return result_model
Now we are testing all different combinations of features on datasets.
for dataset in wine_quality_without_outliers_dfs:
dataset_name = get_wine_str(dataset)
model = find_best_model(dataset)
results_df = pd.concat([results_df, model], axis=1)
5.2 The list of the best models with their scores:¶
results_df = results_df.transpose()
results_df
5.2.1 Comparision of poor wine:¶
def describe_the_model(number_of_the_model):
'''
Print description of the model
Parameters (int): number of the model in the result_df
'''
model = results_df.iloc[number_of_the_model]
print('-------------------------------------------------------')
print(f'The summary for the {model["Model name"]}:\n\n{model["model"].summary()}.\n')
print('-------------------------------------------------------')
print(f'R-squared on Test Set: {model["Test r2"]}')
print(f'Mean Squared Error on Test Set: {model["Test MSE"]}\n')
describe_the_model(0)
-------------------------------------------------------
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py:3790, in Index.get_loc(self, key) 3789 try: -> 3790 return self._engine.get_loc(casted_key) 3791 except KeyError as err: File index.pyx:152, in pandas._libs.index.IndexEngine.get_loc() File index.pyx:160, in pandas._libs.index.IndexEngine.get_loc() File pandas\_libs\index_class_helper.pxi:70, in pandas._libs.index.Int64Engine._check_type() KeyError: 'Model name' The above exception was the direct cause of the following exception: KeyError Traceback (most recent call last) c:\Users\zosia\OneDrive\Desktop\Wine_quality\Wine_quality\main.ipynb Cell 71 line 1 ----> <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=0'>1</a> describe_the_model(0) c:\Users\zosia\OneDrive\Desktop\Wine_quality\Wine_quality\main.ipynb Cell 71 line 9 <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=6'>7</a> model = results_df.iloc[number_of_the_model] <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=7'>8</a> print('-------------------------------------------------------') ----> <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=8'>9</a> print(f'The summary for the {model["Model name"]}:\n\n{model["model"].summary()}.\n') <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=9'>10</a> print('-------------------------------------------------------') <a href='vscode-notebook-cell:/c%3A/Users/zosia/OneDrive/Desktop/Wine_quality/Wine_quality/main.ipynb#Y236sZmlsZQ%3D%3D?line=10'>11</a> print(f'R-squared on Test Set: {model["Test r2"]}') File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py:1040, in Series.__getitem__(self, key) 1037 return self._values[key] 1039 elif key_is_scalar: -> 1040 return self._get_value(key) 1042 # Convert generator to list before going through hashable part 1043 # (We will iterate through the generator there to check for slices) 1044 if is_iterator(key): File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\series.py:1156, in Series._get_value(self, label, takeable) 1153 return self._values[label] 1155 # Similar to Index.get_value, but we do not fall back to positional -> 1156 loc = self.index.get_loc(label) 1158 if is_integer(loc): 1159 return self._values[loc] File c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\pandas\core\indexes\base.py:3797, in Index.get_loc(self, key) 3792 if isinstance(casted_key, slice) or ( 3793 isinstance(casted_key, abc.Iterable) 3794 and any(isinstance(x, slice) for x in casted_key) 3795 ): 3796 raise InvalidIndexError(key) -> 3797 raise KeyError(key) from err 3798 except TypeError: 3799 # If we have a listlike key, _check_indexing_error will raise 3800 # InvalidIndexError. Otherwise we fall through and re-raise 3801 # the TypeError. 3802 self._check_indexing_error(key) KeyError: 'Model name'
describe_the_model(2)
-------------------------------------------------------
The summary for the Model_Red Wine Poor (Without Outliers):
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.099
Model: OLS Adj. R-squared: 0.087
Method: Least Squares F-statistic: 8.017
Date: Tue, 21 Nov 2023 Prob (F-statistic): 2.76e-10
Time: 17:55:09 Log-Likelihood: -180.16
No. Observations: 592 AIC: 378.3
Df Residuals: 583 BIC: 417.8
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const -5.4040 12.471 -0.433 0.665 -29.897 19.089
volatile acidity -0.5369 0.090 -5.985 0.000 -0.713 -0.361
total sulfur dioxide 0.0017 0.000 4.119 0.000 0.001 0.003
pH -0.2327 0.111 -2.093 0.037 -0.451 -0.014
citric acid -0.2948 0.116 -2.538 0.011 -0.523 -0.067
alcohol 0.0139 0.022 0.628 0.530 -0.030 0.058
sulphates 0.0283 0.095 0.299 0.765 -0.158 0.215
density 11.2753 12.464 0.905 0.366 -13.204 35.754
residual sugar -0.0408 0.048 -0.855 0.393 -0.134 0.053
==============================================================================
Omnibus: 407.201 Durbin-Watson: 2.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 3857.604
Skew: -3.088 Prob(JB): 0.00
Kurtosis: 13.874 Cond. No. 8.51e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.51e+04. This might indicate that there are
strong multicollinearity or other numerical problems..
-------------------------------------------------------
R-squared on Test Set: 0.13356530134716293
Mean Squared Error on Test Set: 0.08250272298329342
Discussion:¶
White Poor Wine (N=1639)¶
R-squared on the test = 0.070:
- The R-squared value is 0.070, indicating that only 7% of the variability in the dependent variable (quality) is explained by the model. Also, the adjusted R-squared is 0.060, adjusting for the number of predictor variables, suggesting that we should reduce the number of parameters because they are not improving the model.
F-Statistic = 11.69:
- The model is significant.
Features Significantly Constituting the Model:
- free sulfur dioxide: A one-unit increase in free sulfur dioxide is associated with a 0.0020 increase in quality.
- volatile acidity: A one-unit increase in volatile acidity is associated with a -0.1660 decrease in quality.
- residual sugar: A one-unit increase in residual sugar is associated with a 0.0043 increase in quality.
Red Poor Wine (N=741)¶
R-squared on the test = 0.133:
- The R-squared value is 0.099, indicating that approximately 9.9% of the variability in the dependent variable (quality) is explained by the model. However, the adjusted R-squared is 0.087, adjusting for the number of predictor variables. This is slightly lower than the R-squared, suggesting that not all added variables contribute meaningfully to the model.
F-Statistic = 8.017:
- The F-statistic tests the overall significance of the model.
Features Significantly Constituting the Model:
- volatile acidity: A one-unit increase in volatile acidity is associated with an approximate -0.537 decrease in wine quality.
- total sulfur dioxide: A one-unit increase in total sulfur dioxide is associated with an increase of 0.0017 in wine quality.
- pH: A one-unit increase in pH is associated with an approximate -0.233 decrease in wine quality.
Summary:¶
Based on our models we can observe that volatile acidity and level of free/total sulfur dioxide are importnat feature in predicitng poor wine. Since the models don't have the best performance and also the adjusting r2 suggests that we may reduce the nmber of feautres in the model without loosing a lot of information. Also it will be interesing to observe how combination of multicolinear features like total sulfur dioxide and free sulfur dioxide will influance the model.
5.2.2 Comparision of good wine:¶
describe_the_model(1)
-------------------------------------------------------
The summary for the Model_White Wine Good (Without Outliers):
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.116
Model: OLS Adj. R-squared: 0.113
Method: Least Squares F-statistic: 42.62
Date: Tue, 21 Nov 2023 Prob (F-statistic): 1.61e-64
Time: 17:43:28 Log-Likelihood: -2161.5
No. Observations: 2603 AIC: 4341.
Df Residuals: 2594 BIC: 4394.
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 115.9796 22.978 5.048 0.000 70.923 161.036
alcohol 0.0188 0.028 0.662 0.508 -0.037 0.074
density -113.8570 23.251 -4.897 0.000 -159.449 -68.265
chlorides -0.0868 0.044 -1.995 0.046 -0.172 -0.002
total sulfur dioxide 0.0004 0.000 1.076 0.282 -0.000 0.001
residual sugar 0.0499 0.009 5.636 0.000 0.033 0.067
pH 0.6328 0.113 5.614 0.000 0.412 0.854
fixed acidity 0.0952 0.024 3.979 0.000 0.048 0.142
citric acid 0.0427 0.110 0.388 0.698 -0.174 0.259
==============================================================================
Omnibus: 451.296 Durbin-Watson: 1.930
Prob(Omnibus): 0.000 Jarque-Bera (JB): 724.773
Skew: 1.182 Prob(JB): 4.15e-158
Kurtosis: 4.047 Cond. No. 4.18e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.18e+05. This might indicate that there are
strong multicollinearity or other numerical problems..
-------------------------------------------------------
R-squared on Test Set: 0.1193013699182327
Mean Squared Error on Test Set: 0.3260904708463896
describe_the_model(3)
-------------------------------------------------------
The summary for the Model_Red Wine Good (Without Outliers):
OLS Regression Results
==============================================================================
Dep. Variable: quality R-squared: 0.196
Model: OLS Adj. R-squared: 0.187
Method: Least Squares F-statistic: 20.48
Date: Tue, 21 Nov 2023 Prob (F-statistic): 7.36e-28
Time: 17:55:23 Log-Likelihood: -419.30
No. Observations: 680 AIC: 856.6
Df Residuals: 671 BIC: 897.3
Df Model: 8
Covariance Type: nonrobust
========================================================================================
coef std err t P>|t| [0.025 0.975]
----------------------------------------------------------------------------------------
const 10.0697 13.171 0.765 0.445 -15.791 35.930
alcohol 0.1289 0.022 5.962 0.000 0.086 0.171
volatile acidity -0.1880 0.146 -1.292 0.197 -0.474 0.098
sulphates 0.6421 0.137 4.682 0.000 0.373 0.911
citric acid 0.1350 0.149 0.904 0.366 -0.158 0.428
chlorides -0.1498 0.063 -2.375 0.018 -0.274 -0.026
total sulfur dioxide -0.0018 0.001 -2.400 0.017 -0.003 -0.000
density -5.0362 13.081 -0.385 0.700 -30.722 20.649
pH -0.2665 0.150 -1.772 0.077 -0.562 0.029
==============================================================================
Omnibus: 113.755 Durbin-Watson: 1.973
Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.803
Skew: 1.108 Prob(JB): 1.10e-38
Kurtosis: 4.121 Cond. No. 4.95e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.95e+04. This might indicate that there are
strong multicollinearity or other numerical problems..
-------------------------------------------------------
R-squared on Test Set: 0.28466886558862003
Mean Squared Error on Test Set: 0.147073314184919
Discussion:¶
White Good Wine (N=3249)¶
R-squared on the test = 0.119:
- The R-squared value on the test is 0.119, indicating that approximately 11.9% of the variability in the dependent variable (quality) is explained by the model.
F-Statistic = 42.62:
- The F-statistic tests the overall significance of the model. A larger F-statistic with a small p-value suggests that at least one predictor variable is significantly related to the dependent variable.
- The model is statistically significant.
Features Significantly Constituting the Model:
- density: A one-unit increase in density is associated with a quality decrease of approximately 113.8570 units.
- cholorides: A one-unit increase in chloride content is associated with a quality decrease of approximately 0.0868 units.
- residual sugar: A one-unit increase in residual sugar is associated with a quality increase of approximately 0.0499 units.
- pH: A one-unit increase in pH is associated with a quality increase of approximately 0.6328 units.
- fixed acidity: A one-unit increase in fixed acidity is associated with a quality increase of approximately 0.0952 units.
Red Good Wine (N=851)¶
R-squared on the test = 0.285:
- The R-squared value on the test is 0.285, indicating that approximately 28.5% of the variability in the dependent variable (quality) is explained by the model. It is much higher value then r2 of train data, which can by cause by random fluctuation.
F-Statistic = 20.48:
- The model is statistically significant.
Features Significantly Constituting the Model:
- alcohol: A one-unit increase in alcohol is associated with a quality decrease of approximately 0.1289 units.
- sulphates: A one-unit increase in sulphates content is associated with a quality decrease of approximately 0.6421 units.
- chlorides: A one-unit increase in chloride content is associated with a quality decrease of approximately 0.1498 units.
- total sulfur dioxide: A one-unit increase in totla sulfur dioxide is associated with a quality increase of approximately 0.0018 units.
Summary:¶
Based on our models we can say that the chlorides level can be an important predictor of good quality of wine. What is interesing here is the high score and error of density can suggest that there are two clusters of points with different density level. In the future it will be interesing to split the data to smoler groups and observe subclusters. Maybe they can by characterised by different quality level... To check it in the future it can be interesting to check how models with different density level clusters will perform.
Recipe¶
def identyfy_three_important_features(evaluation_matrix_list):
result_dict = []
# Iterate through unique dataframes
for dataframe in evaluation_matrix_list:
df_subset = dataframe[0].iloc[1:].copy()
df_subset['Coefficients'] = df_subset['Coefficients'].abs()
# Sort the subset based on coefficients in descending order
sorted_subset = df_subset.sort_values(by='Coefficients', ascending=False).reset_index(drop=True)
# Select the top three important features
top_three_features = sorted_subset['Parameters'].loc[:2:].tolist()
# Create a dictionary entry for the current dataframe
results_df = result_dict.append({dataframe[0]['df'][1] : top_three_features})
return result_dict
Finding the recipe for the good and poor wine¶
top_three_features_list = identyfy_three_important_features(evaluation_matrix_list=evaluation_matrix_list)
for df in wine_quality_without_outliers_dfs:
print(get_wine_str(df))
White Wine Poor (Without Outliers) White Wine Good (Without Outliers) Red Wine Poor (Without Outliers) Red Wine Good (Without Outliers)
features_to_reverse_white = ['chlorides', 'volatile acidity']
features_to_reverse_red = ['residual sugar', 'chlorides']
def correct_reverse_log_transformation(df, features):
for feature in features:
if feature in df.columns:
df[feature] = np.exp(df[feature])
return df
df_white_good_without_outliers = correct_reverse_log_transformation(df_white_good_without_outliers, features_to_reverse_white)
df_white_poor_without_outliers = correct_reverse_log_transformation(df_white_poor_without_outliers, features_to_reverse_white)
df_red_good_without_outliers = correct_reverse_log_transformation(df_red_good_without_outliers, features_to_reverse_red)
df_red_poor_without_outliers = correct_reverse_log_transformation(df_red_poor_without_outliers, features_to_reverse_red)
df_white_good_without_outliers
/var/folders/bg/px1wxjwx6dz5w5214_zfxwx40000gn/T/ipykernel_97189/2879086458.py:7: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy df[feature] = np.exp(df[feature])
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | quality_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.271 | 0.36 | 20.7 | 0.046 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.8 | 6 | good |
| 1 | 6.3 | 0.301 | 0.34 | 1.6 | 0.050 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.5 | 6 | good |
| 2 | 8.1 | 0.281 | 0.40 | 6.9 | 0.051 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.1 | 6 | good |
| 3 | 7.2 | 0.231 | 0.32 | 8.5 | 0.059 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 | good |
| 4 | 7.2 | 0.231 | 0.32 | 8.5 | 0.059 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 | good |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4891 | 5.7 | 0.211 | 0.32 | 0.9 | 0.039 | 38.0 | 121.0 | 0.99074 | 3.24 | 0.46 | 10.6 | 6 | good |
| 4893 | 6.2 | 0.211 | 0.29 | 1.6 | 0.040 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 | good |
| 4895 | 6.5 | 0.241 | 0.19 | 1.2 | 0.042 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 | good |
| 4896 | 5.5 | 0.291 | 0.30 | 1.1 | 0.023 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 | good |
| 4897 | 6.0 | 0.211 | 0.38 | 0.8 | 0.021 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 | good |
3254 rows × 13 columns
df_white_good_without_outliers
| fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | quality_category | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.0 | 0.271 | 0.36 | 20.7 | 0.046 | 45.0 | 170.0 | 1.00100 | 3.00 | 0.45 | 8.8 | 6 | good |
| 1 | 6.3 | 0.301 | 0.34 | 1.6 | 0.050 | 14.0 | 132.0 | 0.99400 | 3.30 | 0.49 | 9.5 | 6 | good |
| 2 | 8.1 | 0.281 | 0.40 | 6.9 | 0.051 | 30.0 | 97.0 | 0.99510 | 3.26 | 0.44 | 10.1 | 6 | good |
| 3 | 7.2 | 0.231 | 0.32 | 8.5 | 0.059 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 | good |
| 4 | 7.2 | 0.231 | 0.32 | 8.5 | 0.059 | 47.0 | 186.0 | 0.99560 | 3.19 | 0.40 | 9.9 | 6 | good |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4891 | 5.7 | 0.211 | 0.32 | 0.9 | 0.039 | 38.0 | 121.0 | 0.99074 | 3.24 | 0.46 | 10.6 | 6 | good |
| 4893 | 6.2 | 0.211 | 0.29 | 1.6 | 0.040 | 24.0 | 92.0 | 0.99114 | 3.27 | 0.50 | 11.2 | 6 | good |
| 4895 | 6.5 | 0.241 | 0.19 | 1.2 | 0.042 | 30.0 | 111.0 | 0.99254 | 2.99 | 0.46 | 9.4 | 6 | good |
| 4896 | 5.5 | 0.291 | 0.30 | 1.1 | 0.023 | 20.0 | 110.0 | 0.98869 | 3.34 | 0.38 | 12.8 | 7 | good |
| 4897 | 6.0 | 0.211 | 0.38 | 0.8 | 0.021 | 22.0 | 98.0 | 0.98941 | 3.26 | 0.32 | 11.8 | 6 | good |
3254 rows × 13 columns
def get_receipes(top_three_features_list, df_lists):
for df in df_lists:
pipeline_name = f"Pipeline_{get_wine_str(df)}"
for model in top_three_features_list:
if pipeline_name in model:
feature_list = model[pipeline_name]
numeric_columns = check_numeric_columns(df)
numeric_df = df[numeric_columns]
means = numeric_df[feature_list].mean()
stds = numeric_df[feature_list].std()
ranges = {feature: (means[feature] - stds[feature], means[feature] + stds[feature]) for feature in feature_list}
print(f"\nRecipes of {get_wine_str(df)} wine:")
for feature, range_vals in ranges.items():
print(f"{feature.capitalize()}: {range_vals[0]:.2f} - {range_vals[1]:.2f}")
get_receipes(top_three_features_list, wine_quality_without_outliers_dfs)
Recipes of White Wine Poor (Without Outliers) wine: Volatile acidity: 0.20 - 0.42 Free sulfur dioxide: 16.18 - 53.87 Fixed acidity: 6.08 - 7.85 Recipes of White Wine Good (Without Outliers) wine: Alcohol: 9.60 - 12.09 Residual sugar: 1.22 - 10.86 Volatile acidity: 0.17 - 0.35 Recipes of Red Wine Poor (Without Outliers) wine: Volatile acidity: 0.41 - 0.77 Total sulfur dioxide: 17.86 - 91.21 Citric acid: 0.06 - 0.42 Recipes of Red Wine Good (Without Outliers) wine: Alcohol: 9.75 - 11.96 Sulphates: 0.55 - 0.83 Chlorides: 0.05 - 0.12
import matplotlib.pyplot as plt
import numpy as np
from sklearn.ensemble import RandomForestClassifier
# Function to train Random Forest and get feature importances
def get_feature_importances(dataframe):
# Check and select only numeric columns, excluding 'quality'
numeric_columns = dataframe.select_dtypes(include=np.number).columns.tolist()
if 'quality' in numeric_columns:
numeric_columns.remove('quality')
X = dataframe[numeric_columns]
y = dataframe['quality']
rf = RandomForestClassifier(random_state=42)
rf.fit(X, y)
return rf.feature_importances_, numeric_columns
# Load your dataframes here
# df_red_good_without_outliers, df_red_poor_without_outliers, df_white_good_without_outliers, df_white_poor_without_outliers
# Get feature importances and numeric columns
good_red_importances, good_red_columns = get_feature_importances(df_red_good_without_outliers)
poor_red_importances, poor_red_columns = get_feature_importances(df_red_poor_without_outliers)
good_white_importances, good_white_columns = get_feature_importances(df_white_good_without_outliers)
poor_white_importances, poor_white_columns = get_feature_importances(df_white_poor_without_outliers)
# Plotting
fig, axes = plt.subplots(1, 4, figsize=(20, 5), sharey=True)
axes[0].bar(good_red_columns, good_red_importances)
axes[1].bar(poor_red_columns, poor_red_importances)
axes[2].bar(good_white_columns, good_white_importances)
axes[3].bar(poor_white_columns, poor_white_importances)
# Add titles and sample sizes as annotations
for i, df, columns in zip(range(4),
[df_red_good_without_outliers, df_red_poor_without_outliers, df_white_good_without_outliers, df_white_poor_without_outliers],
[good_red_columns, poor_red_columns, good_white_columns, poor_white_columns]):
axes[i].set_title(f"{'Good' if i%2 == 0 else 'Poor'} {'Red' if i < 2 else 'White'} Wine")
axes[i].set_xticks(range(len(columns)))
axes[i].set_xticklabels(columns, rotation=45, ha="right")
axes[i].annotate(f"n = {len(df)}", xy=(0.5, 1), xycoords='axes fraction', ha='center')
plt.tight_layout()
plt.show()
Significant Statement¶
Conclusion & Discussions¶
Additional - Gridsearch Model creation¶
Firstly, we decided to create the Grid seach with cross-validation to find the best model for each of our datasets, but after realising that SciKit Learn doesn't provide statistical analysis of the model we had to recreate the Linear Model using scipy library.**
Function: evaluate_GS_model¶
Parameters:
pipe- the pipeline through which we passX_train,y_train,X_test,y_test- dataclassifier_params- hyperparameters for the evaluated modelpipeline_namecv- type of cross-validation (StratifiedKFold 5-fold)
We define a GridSearchCV refitting on r2.
- Fit the model.
- Predict test data.
- Extract the mean
r2from cross-validation. - Select the model with the best parameters.
- Save the results to a dataframe.
- Calculate the score matrix (since SciKit Learn doesn't provide the statistical analysis of the model we tried to recreate it using stats library)
def evaluate_GS_model(
pipe,
X_train,
y_train,
X_test,
y_test,
regression_params,
pipeline_name,
cv=KFold(n_splits=5),
predict_test=True,
predict_train=True,
):
"""
Evaluate a machine learning model using grid search with cross-validation.
Parameters:
- pipe (sklearn.pipeline.Pipeline): The machine learning pipeline to evaluate.
- X_train, y_train, X_test, y_test (pandas.DataFrame): Training and testing datasets.
- regression_params (dict): Parameter grid for grid search.
- pipeline_name (str): Name of the machine learning pipeline.
- cv (sklearn.model_selection._split): Cross-validation strategy (default: KFold with 5 splits).
- predict_test (bool): Whether to predict and evaluate on the test set.
- predict_train (bool): Whether to predict and evaluate on the training set.
Returns:
- tuple: Results DataFrame and Evaluation Matrix DataFrame.
"""
# define grid search
grid_search_model = GridSearchCV(
pipe,
regression_params,
cv=cv,
scoring=["r2", "neg_mean_squared_error", "neg_median_absolute_error"],
refit="neg_mean_squared_error",
return_train_score=True,
verbose=3,
)
# fit model
grid_search_model.fit(X_train, y_train)
y_test_pred = grid_search_model.predict(X_test) if predict_test else None
test_score = metrics.mean_squared_error(y_test, y_test_pred) if predict_test else None
y_train_pred = grid_search_model.predict(X_train) if predict_train else None
train_score = metrics.mean_squared_error(y_train, y_train_pred) if predict_train else None
mean_cv_score = grid_search_model.best_score_
cv_results_df = pd.DataFrame(grid_search_model.cv_results_).iloc[[grid_search_model.best_index_]]
cv_splits_scores_df = cv_results_df.filter(regex=r"split\d*_test_mean_squared_error").reset_index(drop=True)
# getting scores based on the metrics defined in the score
metrics_results_df = cv_results_df.filter(regex=r"mean_test_*").reset_index(drop=True)
best_estimator = grid_search_model.best_estimator_.named_steps[pipe.steps[-1][0]]
coefficients = best_estimator.coef_
intercept = best_estimator.intercept_
# save results in DataFrame
this_result = pd.concat(
[pd.DataFrame({
"pipeline_name": [pipeline_name],
"features": [list(X_test.columns)],
'coef': [coefficients],
"train score": [train_score],
"mean_cv_score": [mean_cv_score],
"test_score": [test_score],
"best_model": [grid_search_model.best_estimator_],
"parameters": [grid_search_model.best_params_],
}),
cv_splits_scores_df,
metrics_results_df,
],
axis=1
)
# create DataFrame for the evaluation matrix
parameters = np.append(intercept, coefficients)
dataset_column_names = ['Intercept']
dataset_column_names.extend(list(X_test.columns))
evaluation_matrix_df = pd.DataFrame({
'df': [pipeline_name for number in range(len(parameters))],
'const': [pipeline_name for number in range(len(parameters))],
"Parameters": pd.Series(dataset_column_names),
"Coefficients": pd.Series(parameters),
})
return this_result, evaluation_matrix_df
Define Linear Models using pipelines¶
- For y we take the labels representing the quality of wine form 1 to 10 (1 the lowest score)
- For X we select the 8 psychomical features.
- The three pairs of the features that were highly correlated we chose the former, the later or the mean of both.
#params for our gridsearch
base_steps = [('scaler', StandardScaler())] #normalizing values
lr = ("lr", LinearRegression())
lr_params = dict(lr__fit_intercept=[True, False])
estimators = [(lr, lr_params)]
def building_pipelines(df_lists, features=[top_features_dict, features_among_highly_associated_dict], test_size=0.2, random_state=42):
"""
Create pipelines and evaluate models for each dataset and feature set.
Parameters:
- df_lists (list): List of datasets.
- features (list): List of feature sets to use in the pipelines.
- test_size (float): Size of the test set in the train-test split (default: 0.2).
- random_state (int): Random seed for reproducibility (default: 42).
Returns:
- tuple: Pipeline, DataFrame with scores, and a list of evaluation matrices.
"""
results_df = pd.DataFrame()
evaluation_matrix_list = []
# iterate through each dataset
for dataset in df_lists:
dataset_name = get_wine_str(dataset)
# iterate through each feature list
for feature_set in features:
# name the pipeline
pipeline_name = f"Pipeline_{dataset_name}"
# create X and y sets
y = dataset['quality']
X = dataset[feature_set[dataset_name]]
# train-test split
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=test_size,
random_state=random_state
)
# GridSearch
for (estimator, params) in estimators:
print(f"Evaluating {estimator}\n")
# create a pipeline from the base steps list and estimator
pipe = Pipeline(base_steps + [estimator])
this_results, evaluation_matrix = evaluate_GS_model(
pipe=pipe,
pipeline_name=pipeline_name,
X_train=X_train,
y_train=y_train,
X_test=X_test,
y_test=y_test,
regression_params=params,
predict_test=True,
predict_train=True,
cv=StratifiedKFold(n_splits=5)
)
results_df = pd.concat([results_df, this_results], ignore_index=True)
evaluation_matrix_list.append([evaluation_matrix])
return pipe, results_df, evaluation_matrix_list
Testing pipeline on our datasets¶
pipe, results_df, evaluation_matrix_list = building_piplines(wine_quality_without_outliers_dfs)
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.116) neg_median_absolute_error: (train=-0.113, test=-0.117) r2: (train=0.079, test=0.089) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.118) neg_median_absolute_error: (train=-0.112, test=-0.110) r2: (train=0.082, test=0.073) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.121) neg_median_absolute_error: (train=-0.113, test=-0.102) r2: (train=0.082, test=0.068) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.114, test=-0.115) neg_median_absolute_error: (train=-0.113, test=-0.128) r2: (train=0.093, test=0.017) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.108) neg_median_absolute_error: (train=-0.116, test=-0.099) r2: (train=0.081, test=0.080) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.951, test=-23.974) neg_median_absolute_error: (train=-4.962, test=-4.975) r2: (train=-193.909, test=-187.441) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.951, test=-23.943) neg_median_absolute_error: (train=-4.967, test=-4.971) r2: (train=-193.906, test=-187.196) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.959, test=-23.811) neg_median_absolute_error: (train=-4.969, test=-4.953) r2: (train=-195.144, test=-181.995) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.943, test=-24.073) neg_median_absolute_error: (train=-4.965, test=-4.983) r2: (train=-189.907, test=-204.653) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.945, test=-23.924) neg_median_absolute_error: (train=-4.966, test=-4.960) r2: (train=-189.919, test=-203.378) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.117) neg_median_absolute_error: (train=-0.120, test=-0.126) r2: (train=0.062, test=0.083) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.119) neg_median_absolute_error: (train=-0.121, test=-0.121) r2: (train=0.067, test=0.062) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.114, test=-0.120) neg_median_absolute_error: (train=-0.122, test=-0.112) r2: (train=0.063, test=0.075) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.115, test=-0.117) neg_median_absolute_error: (train=-0.117, test=-0.120) r2: (train=0.080, test=-0.003) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.116, test=-0.112) neg_median_absolute_error: (train=-0.119, test=-0.115) r2: (train=0.072, test=0.042) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.953, test=-23.958) neg_median_absolute_error: (train=-4.974, test=-4.978) r2: (train=-193.926, test=-187.314) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.953, test=-23.975) neg_median_absolute_error: (train=-4.970, test=-4.982) r2: (train=-193.921, test=-187.447) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.962, test=-23.856) neg_median_absolute_error: (train=-4.976, test=-4.965) r2: (train=-195.164, test=-182.338) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.945, test=-24.016) neg_median_absolute_error: (train=-4.966, test=-4.979) r2: (train=-189.920, test=-204.165) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-23.946, test=-23.933) neg_median_absolute_error: (train=-4.973, test=-4.960) r2: (train=-189.928, test=-203.455) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.307, test=-0.316) neg_median_absolute_error: (train=-0.394, test=-0.403) r2: (train=0.119, test=0.101) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.305, test=-0.321) neg_median_absolute_error: (train=-0.393, test=-0.402) r2: (train=0.122, test=0.087) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.309, test=-0.306) neg_median_absolute_error: (train=-0.395, test=-0.395) r2: (train=0.112, test=0.130) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.311, test=-0.299) neg_median_absolute_error: (train=-0.393, test=-0.374) r2: (train=0.114, test=0.120) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.307) neg_median_absolute_error: (train=-0.393, test=-0.410) r2: (train=0.116, test=0.116) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.059, test=-41.015) neg_median_absolute_error: (train=-6.193, test=-6.187) r2: (train=-117.012, test=-115.627) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.052, test=-41.138) neg_median_absolute_error: (train=-6.193, test=-6.188) r2: (train=-117.029, test=-115.828) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.055, test=-41.067) neg_median_absolute_error: (train=-6.180, test=-6.193) r2: (train=-117.039, test=-115.628) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.073, test=-41.279) neg_median_absolute_error: (train=-6.178, test=-6.195) r2: (train=-116.042, test=-120.540) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.065, test=-40.801) neg_median_absolute_error: (train=-6.186, test=-6.158) r2: (train=-116.699, test=-116.318) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.313) neg_median_absolute_error: (train=-0.402, test=-0.405) r2: (train=0.114, test=0.110) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.306, test=-0.322) neg_median_absolute_error: (train=-0.393, test=-0.402) r2: (train=0.121, test=0.084) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.309, test=-0.308) neg_median_absolute_error: (train=-0.403, test=-0.398) r2: (train=0.111, test=0.125) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.312, test=-0.299) neg_median_absolute_error: (train=-0.400, test=-0.386) r2: (train=0.112, test=0.118) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.308, test=-0.311) neg_median_absolute_error: (train=-0.400, test=-0.416) r2: (train=0.116, test=0.105) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.061, test=-40.973) neg_median_absolute_error: (train=-6.181, test=-6.165) r2: (train=-117.017, test=-115.508) total time= 0.0s
c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_split.py:737: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5. warnings.warn( c:\Users\zosia\AppData\Local\Programs\Python\Python311\Lib\site-packages\sklearn\model_selection\_split.py:737: UserWarning: The least populated class in y has only 4 members, which is less than n_splits=5. warnings.warn(
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.052, test=-41.174) neg_median_absolute_error: (train=-6.185, test=-6.183) r2: (train=-117.031, test=-115.931) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.055, test=-41.141) neg_median_absolute_error: (train=-6.175, test=-6.189) r2: (train=-117.040, test=-115.838) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.074, test=-41.249) neg_median_absolute_error: (train=-6.178, test=-6.190) r2: (train=-116.044, test=-120.454) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-41.065, test=-40.775) neg_median_absolute_error: (train=-6.175, test=-6.164) r2: (train=-116.698, test=-116.245) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.108, test=-0.109) neg_median_absolute_error: (train=-0.107, test=-0.098) r2: (train=0.074, test=0.168) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.105, test=-0.120) neg_median_absolute_error: (train=-0.108, test=-0.105) r2: (train=0.098, test=0.085) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.092) neg_median_absolute_error: (train=-0.103, test=-0.115) r2: (train=0.096, test=0.097) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.113, test=-0.089) neg_median_absolute_error: (train=-0.107, test=-0.086) r2: (train=0.092, test=0.119) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.099, test=-0.150) neg_median_absolute_error: (train=-0.109, test=-0.104) r2: (train=0.153, test=-0.138) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.103, test=-24.027) neg_median_absolute_error: (train=-4.979, test=-4.979) r2: (train=-205.805, test=-182.519) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.101, test=-24.176) neg_median_absolute_error: (train=-4.978, test=-4.983) r2: (train=-205.781, test=-183.659) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.068, test=-24.153) neg_median_absolute_error: (train=-4.970, test=-4.989) r2: (train=-193.265, test=-237.012) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.069, test=-24.062) neg_median_absolute_error: (train=-4.980, test=-4.969) r2: (train=-193.269, test=-236.116) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.096, test=-24.000) neg_median_absolute_error: (train=-4.965, test=-4.952) r2: (train=-206.141, test=-180.912) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.108, test=-0.109) neg_median_absolute_error: (train=-0.108, test=-0.101) r2: (train=0.076, test=0.165) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.105, test=-0.122) neg_median_absolute_error: (train=-0.108, test=-0.104) r2: (train=0.101, test=0.070) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.093) neg_median_absolute_error: (train=-0.104, test=-0.116) r2: (train=0.099, test=0.083) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.112, test=-0.089) neg_median_absolute_error: (train=-0.107, test=-0.085) r2: (train=0.092, test=0.122) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.099, test=-0.152) neg_median_absolute_error: (train=-0.105, test=-0.106) r2: (train=0.152, test=-0.151) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.103, test=-24.045) neg_median_absolute_error: (train=-4.980, test=-4.979) r2: (train=-205.802, test=-182.654) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.100, test=-24.131) neg_median_absolute_error: (train=-4.977, test=-4.984) r2: (train=-205.777, test=-183.316) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.068, test=-24.181) neg_median_absolute_error: (train=-4.970, test=-4.982) r2: (train=-193.262, test=-237.280) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.069, test=-24.052) neg_median_absolute_error: (train=-4.978, test=-4.968) r2: (train=-193.269, test=-236.016) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-24.096, test=-24.007) neg_median_absolute_error: (train=-4.965, test=-4.958) r2: (train=-206.141, test=-180.964) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.200, test=-0.212) neg_median_absolute_error: (train=-0.321, test=-0.284) r2: (train=0.206, test=0.126) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.206, test=-0.187) neg_median_absolute_error: (train=-0.322, test=-0.325) r2: (train=0.183, test=0.230) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.204, test=-0.192) neg_median_absolute_error: (train=-0.322, test=-0.271) r2: (train=0.189, test=0.207) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.199, test=-0.217) neg_median_absolute_error: (train=-0.304, test=-0.371) r2: (train=0.193, test=0.177) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.196, test=-0.231) neg_median_absolute_error: (train=-0.299, test=-0.324) r2: (train=0.209, test=0.111) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.608, test=-39.506) neg_median_absolute_error: (train=-6.165, test=-6.181) r2: (train=-156.177, test=-162.140) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.614, test=-39.401) neg_median_absolute_error: (train=-6.161, test=-6.145) r2: (train=-156.199, test=-161.706) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.612, test=-39.814) neg_median_absolute_error: (train=-6.159, test=-6.177) r2: (train=-156.194, test=-163.413) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.561, test=-39.274) neg_median_absolute_error: (train=-6.170, test=-6.129) r2: (train=-159.378, test=-148.131) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.581, test=-39.966) neg_median_absolute_error: (train=-6.164, test=-6.186) r2: (train=-158.922, test=-152.619) total time= 0.0s
Rating ('lr', LinearRegression())
Fitting 5 folds for each of 2 candidates, totalling 10 fits
[CV 1/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.200, test=-0.210) neg_median_absolute_error: (train=-0.313, test=-0.292) r2: (train=0.207, test=0.132) total time= 0.0s
[CV 2/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.205, test=-0.187) neg_median_absolute_error: (train=-0.312, test=-0.326) r2: (train=0.187, test=0.229) total time= 0.0s
[CV 3/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.204, test=-0.191) neg_median_absolute_error: (train=-0.316, test=-0.266) r2: (train=0.192, test=0.211) total time= 0.0s
[CV 4/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.199, test=-0.214) neg_median_absolute_error: (train=-0.303, test=-0.361) r2: (train=0.195, test=0.187) total time= 0.0s
[CV 5/5] END lr__fit_intercept=True; neg_mean_squared_error: (train=-0.195, test=-0.231) neg_median_absolute_error: (train=-0.295, test=-0.313) r2: (train=0.213, test=0.111) total time= 0.0s
[CV 1/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.608, test=-39.494) neg_median_absolute_error: (train=-6.166, test=-6.185) r2: (train=-156.175, test=-162.092) total time= 0.0s
[CV 2/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.613, test=-39.405) neg_median_absolute_error: (train=-6.160, test=-6.151) r2: (train=-156.195, test=-161.723) total time= 0.0s
[CV 3/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.612, test=-39.811) neg_median_absolute_error: (train=-6.161, test=-6.169) r2: (train=-156.191, test=-163.399) total time= 0.0s
[CV 4/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.560, test=-39.312) neg_median_absolute_error: (train=-6.160, test=-6.119) r2: (train=-159.376, test=-148.273) total time= 0.0s
[CV 5/5] END lr__fit_intercept=False; neg_mean_squared_error: (train=-39.580, test=-39.955) neg_median_absolute_error: (train=-6.163, test=-6.188) r2: (train=-158.918, test=-152.576) total time= 0.0s
Results of the model:¶
results_df
| pipeline_name | features | coef | train score | mean_cv_score | test_score | best_model | parameters | mean_test_r2 | mean_test_neg_mean_squared_error | mean_test_neg_median_absolute_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Pipeline_White Wine Poor (Without Outliers) | [free sulfur dioxide, volatile acidity, total ... | [0.009804424160145675, -0.059479220409756464, ... | 0.113651 | -0.115599 | 0.136677 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.065193 | -0.115599 | -0.111200 |
| 1 | Pipeline_White Wine Poor (Without Outliers) | [free sulfur dioxide, volatile acidity, residu... | [0.037149606235627075, -0.05443671297011143, 0... | 0.115442 | -0.117185 | 0.136345 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.051781 | -0.117185 | -0.118854 |
| 2 | Pipeline_White Wine Good (Without Outliers) | [alcohol, density, chlorides, total sulfur dio... | [0.02335745624729898, -0.33332421629431724, -0... | 0.308168 | -0.310040 | 0.326090 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.110858 | -0.310040 | -0.396855 |
| 3 | Pipeline_White Wine Good (Without Outliers) | [alcohol, chlorides, total sulfur dioxide, res... | [0.20322606485574743, -0.03970062880296661, 0.... | 0.308824 | -0.310859 | 0.328447 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.108493 | -0.310859 | -0.401499 |
| 4 | Pipeline_Red Wine Poor (Without Outliers) | [volatile acidity, total sulfur dioxide, pH, f... | [-0.0984315648917267, 0.06506705592099327, -0.... | 0.107697 | -0.111995 | 0.083575 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.066067 | -0.111995 | -0.101692 |
| 5 | Pipeline_Red Wine Poor (Without Outliers) | [volatile acidity, total sulfur dioxide, pH, c... | [-0.09738634750601528, 0.06258791750627782, -0... | 0.107612 | -0.113002 | 0.082503 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.057932 | -0.113002 | -0.102416 |
| 6 | Pipeline_Red Wine Good (Without Outliers) | [alcohol, volatile acidity, sulphates, citric ... | [0.12993460890703012, -0.030543707212247334, 0... | 0.201659 | -0.207719 | 0.147937 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.169862 | -0.207719 | -0.314700 |
| 7 | Pipeline_Red Wine Good (Without Outliers) | [alcohol, volatile acidity, sulphates, citric ... | [0.14322209473124037, -0.03049374464085257, 0.... | 0.200962 | -0.206721 | 0.147073 | (StandardScaler(), LinearRegression()) | {'lr__fit_intercept': True} | 0.173800 | -0.206721 | -0.311507 |
Discussion:¶
For each of datasets we identify the most suitable dataset:
White Poor Wine (N=1639) (1)
- r2 on the test = 0.136345
- Features: Volatile Acidity, Free Sulfur Dioxide, Residual Sugar, Alcohol, Total Sulfur Dioxide, Fixed Acidity, Density, Citric Acid.
White Good Wine (N=3249) (3)
- r2 on the test = 0.328447
- Features: Alcohol, Density, Chlorides, Total Sulfur Dioxide, Residual Sugar, pH, Fixed Acidity
Red Poor Wine (N=741) (4)
- r2 on the test = 0.083575
- Features: Volatile Acidity, Total Sulfur Dioxide, pH, Citric Acid, Alcohol, Sulphates, Density, Residual Sugar.
Red Good Wine (N=851) (6)
- r2 on the test = 0.147937
- Features: Alcohol, volatile acidity, sulphates, chlorides, total sulfur dioxide, residual sugar, pH
Comment: Overall, we can say that the models don't predict quality very much, r2 suggest that they describe between 13% to 32% of variance.
When it comes to Intercept and coeficients scores of the model there are presented below:¶
e.g. Pipeline_White Wine Poor (Without Outliers)
evaluation_matrix_list[1][0]
| df | const | Parameters | Coefficients | |
|---|---|---|---|---|
| 0 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | Intercept | 4.882263 |
| 1 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | free sulfur dioxide | 0.037150 |
| 2 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | volatile acidity | -0.054437 |
| 3 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | residual sugar | 0.022813 |
| 4 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | alcohol | -0.017102 |
| 5 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | fixed acidity | -0.024791 |
| 6 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | citric acid | 0.016143 |
| 7 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | chlorides | -0.007053 |
| 8 | Pipeline_White Wine Poor (Without Outliers) | Pipeline_White Wine Poor (Without Outliers) | sulphates | 0.003767 |
Comment: based on the coeddicient we can say that the most importnat feature for White poor wine quality prediction is volatile acidity (|score| = 0.05). However, because the model doesn't provide statistical calcualtion od t value and p-value, we don't know if the score is significant.